OcrV1, Main, Exploration, bibRecord, 000303

Open source optical character recognition for historical research

Identifieur interne : 000303 ( Main/Exploration ); précédent : 000302; suivant : 000304

Open source optical character recognition for historical research

Auteurs : Tobias Blanke [Royaume-Uni] ; Michael Bryant [Royaume-Uni] ; Mark Hedges [Royaume-Uni]

Source :

Journal of documentation [ 0022-0418 ] ; 2012.

RBID : Pascal:13-0104039

Descripteurs français

Pascal (Inist)
- Reconnaissance optique caractère, Workflow, Collection, Archives, Archive, Bibliothèque électronique, Open source.
Wicri :
- topic : Archives.

English descriptors

KwdEn :
- Archive, Archives, Collection, Electronic library, Open source, Optical character recognition, Workflow.

Abstract

Purpose - This paper aims to present an evaluation of open source OCR for supporting research on material in small- to medium-scale historical archives. Design/methodology/approach The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large-scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings - The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high-quality research-oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value - There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre-processing and layout analysis. All this can be done without the need to develop dedicated code.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000059
to stream PascalFrancis, to step Corpus: 000074
to stream PascalFrancis, to step Curation: 000709
to stream PascalFrancis, to step Checkpoint: 000065
to stream Main, to step Merge: 000306
to stream Main, to step Curation: 000303

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Open source optical character recognition for historical research</title>
<author><name sortKey="Blanke, Tobias" sort="Blanke, Tobias" uniqKey="Blanke T" first="Tobias" last="Blanke">Tobias Blanke</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName><settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Bryant, Michael" sort="Bryant, Michael" uniqKey="Bryant M" first="Michael" last="Bryant">Michael Bryant</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName><settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Hedges, Mark" sort="Hedges, Mark" uniqKey="Hedges M" first="Mark" last="Hedges">Mark Hedges</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName><settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">13-0104039</idno>
<date when="2012">2012</date>
<idno type="stanalyst">PASCAL 13-0104039 INIST</idno>
<idno type="RBID">Pascal:13-0104039</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000059</idno>
<idno type="stanalyst">FRANCIS 13-0104039 INIST</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000074</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000709</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000065</idno>
<idno type="wicri:doubleKey">0022-0418:2012:Blanke T:open:source:optical</idno>
<idno type="wicri:Area/Main/Merge">000306</idno>
<idno type="wicri:Area/Main/Curation">000303</idno>
<idno type="wicri:Area/Main/Exploration">000303</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Open source optical character recognition for historical research</title>
<author><name sortKey="Blanke, Tobias" sort="Blanke, Tobias" uniqKey="Blanke T" first="Tobias" last="Blanke">Tobias Blanke</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName><settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Bryant, Michael" sort="Bryant, Michael" uniqKey="Bryant M" first="Michael" last="Bryant">Michael Bryant</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName><settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Hedges, Mark" sort="Hedges, Mark" uniqKey="Hedges M" first="Mark" last="Hedges">Mark Hedges</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName><settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Journal of documentation</title>
<title level="j" type="abbreviated">J. doc.</title>
<idno type="ISSN">0022-0418</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Journal of documentation</title>
<title level="j" type="abbreviated">J. doc.</title>
<idno type="ISSN">0022-0418</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Archive</term>
<term>Archives</term>
<term>Collection</term>
<term>Electronic library</term>
<term>Open source</term>
<term>Optical character recognition</term>
<term>Workflow</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance optique caractère</term>
<term>Workflow</term>
<term>Collection</term>
<term>Archives</term>
<term>Archive</term>
<term>Bibliothèque électronique</term>
<term>Open source</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Archives</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Purpose - This paper aims to present an evaluation of open source OCR for supporting research on material in small- to medium-scale historical archives. Design/methodology/approach The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large-scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings - The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high-quality research-oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value - There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre-processing and layout analysis. All this can be done without the need to develop dedicated code.</div>
</front>
</TEI>
<affiliations><list><country><li>Royaume-Uni</li>
</country>
<region><li>Angleterre</li>
<li>Grand Londres</li>
</region>
<settlement><li>Londres</li>
</settlement>
</list>
<tree><country name="Royaume-Uni"><region name="Angleterre"><name sortKey="Blanke, Tobias" sort="Blanke, Tobias" uniqKey="Blanke T" first="Tobias" last="Blanke">Tobias Blanke</name>
</region>
<name sortKey="Bryant, Michael" sort="Bryant, Michael" uniqKey="Bryant M" first="Michael" last="Bryant">Michael Bryant</name>
<name sortKey="Hedges, Mark" sort="Hedges, Mark" uniqKey="Hedges M" first="Mark" last="Hedges">Mark Hedges</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000303 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000303 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:13-0104039
   |texte=   Open source optical character recognition for historical research
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Open source optical character recognition for historical research

Open source optical character recognition for historical research

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri